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Abstract 

Background: Computational approaches to transcription factor binding site identification liave been actively 
researched in the past decade. Learning from known binding sites, new binding sites of a transcription factor in 
unannotated sequences can be identified. A number of search methods have been introduced over the years. 
However, one can rarely find one single method that performs the best on all the transcription factors. Instead, to 
identify the best method for a particular transcription factor, one usually has to compare a handful of methods. Hence, 
it is highly desirable for a method to perform automatic optimization for individual transcription factors. 

Results: We proposed to search for transcription factor binding sites in vector spaces. This framework allows us to 
identify the best method for each individual transcription factor. We further introduced two novel methods, the 
negative-to-positive vector (NPV) and optimal discriminating vector (ODV) methods, to construct query vectors to 
search for binding sites in vector spaces. Extensive cross-validation experiments showed that the proposed methods 
significantly outperformed the ungapped likelihood under positional background method, a state-of-the-art method, 
and the widely-used position-specific scoring matrix method. We further demonstrated that motif subtypes of a TF 
can be readily identified in this framework and two variants called the /cNPV and /cODV methods benefited 
significantly from motif subtype identification. Finally, independent validation on ChlP-seq data showed that the ODV 
and NPV methods significantly outperformed the other compared methods. 

Conclusions: We conclude that the proposed framework is highly flexible. It enables the two novel methods to 
automatically identify a TF-specific subspace to search for binding sites. Implementations are available as source code 
at: http://biogrid.engr.uconn.edu/tfbs_search/. 



Background 

Transcription of genes followed by translation of their 
transcripts into proteins determines the type and func- 
tions of a cell Expression of certain genes even initiates 
or suppresses differentiation of stem cells. It is there- 
fore crucial to understand the mechanisms of transcrip- 
tional regulation. Among them, transcription factor (TF) 
binding is the one that has been given considerable atten- 
tion by computational biologists for the past decade and is 
still being actively researched. A TF is a protein or protein 
complex that regulates transcription of one or more genes 
by binding to the double-stranded DNA. A first step in 
computational identification of target genes regulated by 
a TF is to pinpoint its binding sites in the genome. Once 
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the binding sites are found, the putative target genes can 
be searched and located in flanking regions of the binding 
sites. 

In general, there are two approaches to computational 
transcription factor binding site (TFBS) identification, 
motif discovery and TFBS search. The former assumes 
that a set of sequences is given and each of the sequences 
may or may not contain TFBSs. An algorithm then pre- 
dicts the locations and lengths of TFBSs. The term motif 
refers to the pattern that are shared by the discovered 
TFBSs. These algorithms rely on no prior knowledge of 
the motif and hence are known as de novo motif discovery 
algorithms. The latter assumes that, in addition to a set 
of sequences, the locations and lengths of TFBSs are 
known. An algorithm then learns from these examples and 
predicts TFBSs in new sequences. Such algorithms are 
also called supervised learning algorithms since they are 
guided by the given sequences with known TFBSs. Plenty 
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of efforts have been devoted to the de novo motif discovery 
problem [1-11]. Comprehensive evaluation and compari- 
son of the developed tools have been performed [12,13]. 
In this study, we focus on the problem of TFBS search. We 
refer readers interested in the motif discovery problem to 
the evaluation and review articles [12-14] and references 
therein. 

A typical TFBS search method searches for the binding 
sites of a particular transcription factor in the following 
manner. It scans a target DNA sequence and compare 
each length / sub-sequence (/-mer) to the binding site 
profile of the TF, where / is the length of a binding site. 
Each of the /-mer is scored when comparing to the pro- 
file. A cut-off score is then set by the method to select 
candidate TF binding sites. The position-specific scoring 
matrix (PSSM) [15] is a widely used profile representa- 
tion, where the binding sites of a TF are encoded as a 
4 X / matrix. Column / of the matrix stores the scores 
of matching the fi^ letter in an /-mer to nucleotides A, 
C, G and T, respectively. Depending on the method of 
choice, the score of A at position / can be the count of A at 
position / in the known TFBSs, the log-transformed prob- 
ability of observing A at position /, or any other reasonable 
number. Once computed, the scoring matrix of a TF can 
be stored in a database. These matrices are used by tools 
[16-21] to scan sequences for TFBSs. 

One assumption the PSSM representation makes is that 
positions in a binding site are independent, which is 
often not the case. Osada et al [22] exploited dependence 
between positions by considering nucleotide pairs in scor- 
ing methods. It was shown that incorporating nucleotide 
pairs significantly improved the performance of a method, 
meaning that most transcription factors studied demon- 
strated correlation between positions in a binding site. 
This result was reinforced in a recent study [23], in which 
the authors showed correlations between two nucleotides 
within a binding site by plotting the mutual informa- 
tion matrix. A novel scoring method called the ungapped 
likelihood under positional background (ULPB) method 
was proposed in this study. The ULPB method models a 
TFBS by two first-order Markov chains and scores a can- 
didate binding site by likelihood ratio produced by the two 
Markov chains, leave-one-out cross-validation results on 
22 TFs with 20 or more binding sites showed that ULPB is 
superior to the methods compared in their work. 

In this work, we approach the TFBS search problem 
from a different perspective. We propose to search for 
binding sites in vector spaces. Specifically, /-mers are 
placed in the Euclidean space such that each /-mer cor- 
responds to a vector in the space. With known binding 
sites of a TF, we construct a profile vector for the TF. 
This profile vector can then be used as a query vector to 
search for the unknown binding sites in the space given a 
similarity measure between two vectors. The vector space 



model has long been used in information retrieval (IR) 
[24,25]. Under this model, each document in a collec- 
tion is embedded in a ^-dimensional space. That is, each 
document is represented by a ^-element vector, where t 
is the number of distinct terms present in the document 
collection or corpus. To search for documents on a par- 
ticular topic, a query composed of terms relevant to the 
topic is constructed. The query can be similarly embedded 
in the ^-dimensional space. Similarity between the query 
and a document can then be measured by measuring the 
similarity between the two corresponding vectors. In the 
TFBS search problem, the entire genome or the collection 
of promoter region sequences corresponds to the corpus, 
whereas an /-mer is analogous to a document in IR. On 
the other hand, a TF is analogous to a topic, while a TF 
representation is the analog of a query for the topic. 

In this framework, we propose two novel approaches 
to constructing a query vector for a TF of interests. 
We compare the proposed methods to a state-of-the-art 
method, the ULPB method, as well as the widely-used 
PSSM method. Performance of a method is assessed by 
cross-validation experiments on two data sets collected 
from RegulonDB [26] and JASPAR [27], respectively. Inde- 
pendent validation on human ChlP-seq data gives further 
insights into the proposed methods. Finally, we discuss 
the advantages of searching for TF binding sites in the 
proposed framework. 

The paper is organized as follows. In Methods, we 
present the novel negative-to-positive vector and optimal 
discriminating vector methods, in addition to introduc- 
ing the existing methods compared in this work. Cross- 
validation results on prokaryotic and eukaryotic tran- 
scription factors are presented and discussed in Results 
and Discussion. Finally, we give the concluding remarks in 
Conclusions. 

Methods 

Data sets 

To understand the compared methods in this work, we 
experimented on prokaryotic as well as eukaryotic tran- 
scription factors. The known prokaryotic TF binding sites 
were collected from from RegulonDB [26] release 6.8. 
Considered in [23], this data source contains binding sites 
of TFs in the E. coli K-12 genome. We considered a data 
set of 26 TFs with 17 or more known binding sites. The fil- 
tering criterion ensures that, for each TF, we have enough 
examples to learn from. Similar filtering criteria were used 
in [23]. This data set is summarized in Table 1. 

The known eukaryotic TF binding sites were collected 
from JASPAR CORE database (the 4* release) [27]. TFs 
of Homo sapiens and Mus musculus were filtered by two 
criteria. A TF was kept only if it has at least 20 known 
binding sites and the length of its binding sites is at least 
6 nucleotides. The length criterion, arbitrarily chosen. 
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Table 1 Statistics of the E. coli TPs in RegulonDB 



Name 


Length 


#TFBSs 


Name 


Length 


#TFBSs 


MetJ 


8 


29 


Lrp 


12 


62 


SoxS 


18 


19 


H-NS 


15 


37 


FlhDC 


16 


20 


AraC 


18 


20 


Fis 


15 


206 


ArcA 


15 


93 


IHF 


13 


101 


OmpR 


20 


22 


PhoB 


20 


17 


GIpR 


20 


23 


OxyR 


17 


41 


CpxR 


15 


37 


NarL 


7 


90 


CRP 


22 


249 


TyrR 


18 


19 


NarP 


7 


20 


Fur 


19 


81 


LexA 


20 


40 


NtrC 


17 


17 


FNR 


14 


87 


MalT 


10 


20 


PhoP 


17 


21 


ArgR 


18 


32 


NsrR 


11 


37 



ensures a TF under consideration is specific enough. This 
data set is summarized in Table 2. 

Notation 

For clarity, we list and define functions and variables used 
throughout this paper. Please see Additional file 1 for 
more details. 

• fi(u) denotes the probability of observing letter u at 
position i of a TFBS, where u e {A, C, G, T}. 

• fi,j(u, v) denotes the probability of observing letters u 
and V at positions i and respectively, where / < / 
and u,ve {A, C, G, T}. 

• fi(v\u) denotes the position-specific conditional 
probability of observing v at position i + 1 given u 
has been seen at position where u,v e {A, C, G, T}. 

• f(v\u) denotes the background conditional 
probability of observing v given u has been observed 
at the previous position, where u,v e {A, C, G, T}. 

• lu(') is the indicator function given by 

f 1 ifv = 
^»('^)=(o otherwise, 

where u,v e {A, C, G, T}. 

• ^uiU2 (•) is similarly defined as follows: 

Iu,u2(nv2)-^Q otherwise, 

where wi, U2, vi, V2 e {A, C, G, T}. 

• IQ denotes the information content at position i of a 
binding site. Information content is closely related to 
entropy, a measure of uncertainty in information 
theory. The entropy at position i is given by 

= - J2ue{A, c, G, T}/(^) log2 {fii^)]- When 
fi(u) = ^ for all u e {A, C, G, T}, Ei attains the 



Table 2 Statistics of TFs in the JASPAR database 



Mus musculus 



ID 


Name 


Length 


#TFBSs 


MA0039.2 


Klf4 


10 


4336 


MA0047.2 


Foxa2 


12 


809 


MA0062.2 


GABPA 


11 


87 


MA0065.2 


PPARG::RXRA 


15 


839 


MA0104.2 


Mycn 


26 


85 


MA0141.1 


Esrrb 


12 


3613 


MA0142.1 


Pou5fl 


15 


1332 


MA0143.1 


Sox2 


15 


666 


MA0 144.1 


Stat3 


19 


830 


MA0145.1 


Tcfcp2l1 


14 


3931 


MAO 146.1 


Zfx 


20 


477 


MAO 147.1 


Myc 


10 


682 


MA0154.1 


EBF1 


10 


21 




Homo sapiens 






ID 


Name 


Length 


#TFBSs 


MA0037 


GATA3 


6 


20 


MA0052 


MEF2A 


10 


31 


MA0077 


S0X9 


9 


45 


MA0080.2 


SPI1 


7 


35 


MA0083 


SRF 


12 


26 


MA0112.2 


ESR1 


20 


472 


MA0115 


NR1H2::RXRA 


17 


22 


MA0137.2 


STAT1 


15 


2082 


MA0138 


REST 


19 


22 


MA0138.2 


REST 


11 


871 


MA0139.1 


CTCF 


11 


944 


MA0148.1 


F0XA1 


11 


896 


MA0149.1 


EWSR1-FLI1 


17 


101 


MA0159.1 


RXR::RAR_DR5 


17 


23 


MA0258.1 


ESR2 


18 


356 



maximal entropy of 2 and we are most uncertain 
about the letter at position i. IQ is simply defined as 

IQ = 2-Ei = 2+ J2 ^'(^) l^fc lfi(^)] • 
ue{A, C, G, T} 

(3) 

• IQj denotes the information content of the position 
pair (/,/) of a binding site. Similarly, 

IQ,j = 4 + ^ fi,j(u, v) log2 \fi,j(u, v)] , 

M,v€{A, C, G, T} 

(4) 
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where the maximal entropy of 4 is attained when 
fi,j(u, y) = Ye ^' ^ ^ T}- 

Embedding short sequences in vector spaces 

We describe how a short sequence of / nucleotides or 
an l-mer is placed in a vector space. Let 5 be an l- 
mer and 5/ denote its i^^ nucleotide. Each nucleotide in 
5 is converted to 4 variables, that is, 5/ is converted to 
WilAisi), Wilc(si), Wilcisi) andw/Xxfe) for / = 1, 2, . . . , /. 
Hence, 5 is converted to 41 variables, placing 5 in R^^, 
Figure 1 illustrates the conversion of each nucleotide in an 
/-mer to 4 variables when iv/ = 1 for / = 1, 2, . . . , /. 

We further consider nucleotide pair (si.sj), where 
/ < y. Only pairs in close proximity are considered 
in this study. We consider (si,sj) only if j — i = 
1 or 2, i.e., a pair of nucleotides is considered only 
if they are adjacent or separated by one nucleotide. 
Nucleotide pair (si.sj) is similarly converted to 16 vari- 
ables, WQlAAisiSj), Wi,jlAc(siSj), . . . , WQlTjisiSj), as there 
are 16 possible nucleotide pairs, {AA, AC, . . . , TT}. We 
use 32/ — 48 additional variables to encode the pairs since 
there are / — 1 adjacent pairs and 1 — 2 pairs separated 
by one nucleotide. Consequently, considering individual 
nucleotides and nucleotide pairs, each /-mer is converted 
to a (36/ — 48) -element vector. 

In this study, we consider two choices of w/s and wqs. 
For the first choice, all the nucleotides and nucleotide 
pairs are given the same weight, i.e., Wi = 1 and Wij = 
1 for all / and y. The second one assigns weight to the 
fi^ nucleotide according to the information content at 
position Similarly, it assigns weight to pair (ij) accord- 
ing to the information content at this pair of positions. 
Specifically, w/ = ^/ICi and = y/C/,y for all / and y. 

Searching for TFBSs in vector spaces 

Given a query vector t in space, we score an /-mer s as 
follows: 

Score(5) = s^t, (5) 

where s denote the corresponding vector of s. In other 
words, the score of 5 is obtained by taking the dot-product 
between s and ^. It can be seen that Score (5) measures the 
similarity between s and t. Assuming that t corresponds to 
an /-mer t, Score (5) counts the number of nucleotides and 
nucleotide pairs shared between s and t when w/ = 1 and 



1 000001 00001 001 0 01 000001 01 000001 

Figure 1 Illustration of embedding a short sequence in vector 
space. Each nucleotide in tine sequence is converted to 4 indicator 
variables. 



Wij = 1 for all / and y. However, we note that t can be any 
vector in the space and does not necessarily correspond to 
an /-mer. 

As described above, an /-mer is converted to a (36/— 48)- 
element vector. Hence, we use t to search for binding sites 
in M^^^^""^^). Our approach offers great flexibility in that it 
easily allows searching for binding sites in a lower dimen- 
sional subspace. By setting all but the first 41 elements in 
t to zero, we are essentially searching for binding sites in 
R^^, In this work, we exploit this advantage and simul- 
taneously search for transcription factor binding sites in 
three subspaces. Two of them are R^^ and m(36/-48) 
third one is R^^^^~^'^\ This subspace is obtained from con- 
sidering only the first nucleotide and the / — 1 adjacent 
nucleotide pairs as in a first order Markov chain. 

The NPV method 

We first introduce a simple approach to constructing a 
query vector. Let P be the set of n+ binding sites and N 
be the set of n- non-binding sites of a particular tran- 
scription factor. We embed all the /-mers in P and N in 
]^(36/-48) then find the mean binding site vector 

^ seP 

as well as the mean non-binding site vector 

fi_ = — y]^. 

seN 

The query vector t is found by subtracting fi_ from fi^, 
that is, t = fi^ — fi_. The query vector t can be seen as 
the vector pointing from the center of the non-binding site 
vectors to the center of the binding site vectors. Hence, 
we call it the negative-to-positive vector (NPV) method. 
Figure 2 illustrates the idea. 

The score of an /-mer 5 given by the NPV method is 
therefore 

Score(5) = s^(fi_^ — fi_) = fi^ — (6) 

We can see that it computes the similarity between s 
and the mean binding site vector as well as the similar- 
ity between s and the mean non-binding site vector. It 
then scores s by the diff'erence of the two similarity scores. 
The more similar 5 is to the mean binding site vector, 
the higher the score. The less similar s is to the mean 
non-binding site vector, the higher the score. 

From the perspective of geometry, we note that Score (5) 
in (5) is proportional to Score(5)/| |^| | , where \\t\\ is the 
length of the query vector ^. Moreover, by virtue of the 
equality 

s^t= \ \s\\ \\t\\cosO, 

we know Score(5)/||^|| equals the orthogonal projection 
of s onto ^, where 0 is the angle formed by vectors s 
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Axis 1 

Figure 2 Illustration of the NPV method. The solid arrow represents 
the negative-to-positive vector fi^ - fi_, pointing from fi_ to fi^. 
The hallow triangles denote the known binding sites, whereas the 
circles represent the known non-binding sites. The center of the 
binding site vectors is marked by the solid triangle, while the center 
of the non-binding site vectors is marked by the solid circle. 



and t (see Figure 3 for an illustration). The computation 
of Score (5) is therefore equivalent to computation of the 
orthogonal projection of s onto ^. Similarly, the computa- 
tion of Score (5) in (6) is equivalent to computation of the 
orthogonal projection of s onto fi_^ — fi_. In Figure 2, we 
observe that vector fi^ — fi- is pointing to the left and, 
projected onto this vector, most of the binding sites are on 
the left of the non-binding sites. This implies that most of 
the binding sites have a higher score than the non-binding 
sites. 

The ODV method 

We have described the NPV method, which offers a 
heuristic way of constructing a query vector. We now 
introduce a way of finding an optimal query vector P G 
]^(36/-48) Suppose that |P| = n+ and \N\ = that is, 
there are binding sites and ri- non-binding sites for 




\\s\\oosQ 



Figure 3 The orthogonal projection of s onto t. It can be seen that 
the projection of s onto t is equal to Score(s)/||f|| a Score(s). 



a particular TF. Let P = {5(i),5(2), . . . and N = 

{5(^^+1), 5(;^^+2)> • • • , where denotes the fi^ /-mer 
in the union of the two sets and n = n+ + We find the 
optimal P by solving the following minimization problem: 



min + — — J2 

Score(s(/)) b + 1 - 
subiectto „ > lorsfA e P, 

' m\ ~ \m 

Score(s(/)) b - 1 + 

< fors(/) e N, 



m\ 

> 0 V/. 



(7) 

(8) 

(9) 
(10) 



The constraint in (8) ensures that the projection of a 
TFBS onto the vector fi, exceeds the thresh- 

old . On the other hand, the constraint in (9) ensures 
that the projection of a non-TFBS onto P stays below 
the threshold Flexibility is given to the thresholds by 
introducing §/s with cost captured by the last two terms in 
(7). Finally, to clearly distinguish TFBSs from non-TFBSs, 
the squared difference between the two thresholds (pj^ 

and pj^) is made as large as possible. This amounts to 

maximizing (^j^^ or, equivalently, minimizing 
which is the first term in (7). We call this approach the 
optimal discriminating vector (ODV) method. 

The optimization problem in (7) is known as a quadratic 
programming problem with linear inequality constraints 
specified in (8), (9) and (10). There are p -\- n -\- 1 vari- 
ables and 2n constraints, where p = 36/ — 48 is the 
dimension of fi. We can see that (8) and (9) specify n 
constraints whereas (10) imposes n constraints on the 
variables. Quadratic programming [28] is well-studied 
and hence general solvers are available, e.g., the OpenOpt 
framework [29]. To solve this problem, the parameter C(> 
0) is first arbitrarily chosen. A solver then searches for val- 
ues ofp = (^1, . . . , Pp)^, b and ^ = . . . , ^n)^ such that 
the objective function in (7) is minimized while the con- 
straints in (8), (9) and (10) are satisfied simultaneously. It 
can be seen that an optimal solution to (7) always exists 
since the search space of Z?, is never empty. To find 
a feasible solution, one can arbitrarily pick ^ ^ e W 
and b e R, For g P, one can pick g R such that the 
constraint in (8) is satisfied. Similarly, for 5(/) G N, one can 
pick G R such that the constraint in (9) is met. We can 
then compute the value of the objective function as the 
values of all the variables are known. One way to choose 
the parameter C in (7) is to search for C in a range by 
cross-validation. The parameter is TF-dependent in gen- 
eral, but experiments showed that a small C = will 
usually suffice and hence we set C = for all the ODV 
experiments in this study. 
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The PSSM and ULPB methods 

We briefly describe the ungapped likelihood under posi- 
tional background (ULPB) method proposed in [23] and 
the position-specific scoring matrix (PSSM) method com- 
pared therein. We refer readers to section Notation for 
functions and variables used here. Consider a specific TF 
with binding sites of length /. The PSSM method scores an 
/-mer s by 

/ 

I] log [^ (5/)], (11) 
i=l 

where Si denotes the fi^ letter in 5. We note that usually the 
ratio^(5/)//(5/) is used in place of ^(5/), where f(si) is the 
background probability of 5/. The simpler form in (11) was 
compared in [23] and hence it serves as a baseline method 
in this study. 

The ULPB models a TFBS by a first-order Markov chain 
and models the background by another first-order Markov 
chain. The background transition probabilities are esti- 
mated using the entire genome of a species and hence the 
ULPB method uses negative examples implicitly. It scores 
an /-mer s by 

iog/i(s,)+i:iog(f^). (12) 

Although ULPB does not consider background probability 
in the first term of (12), the score is approximately the log- 
likelihood ratio of the two Markov chains. 

The main difference between the PSSM method and 
the NPV, ODV and ULPB methods is that the PSSM 
method does not score nucleotide pairs nor does it utilize 
a background distribution. The NPV and ODV methods 
explicitly take advantage of negative binding sites, while 
the ULPB method does it implicitly by using a background 
distribution. The flexibility of the proposed framework 
allows the NPV and ODV methods to easily search in 
subspaces, further distinguishing the PSSM and ULPB 
methods from the proposed ones. 

Results and discussion 

Performance assessment and evaluation metrics 

The performance of a TFBS search method is evaluated 
by v-fold cross-validation (CV). Consider a TF with 
TFBSs of length / with flanking regions on both sides. A 
set of negative examples. A/test^ called the test negatives is 
constructed from the TFBSs of the other TFs with filtering 
as in [22]. Another set of negative examples, A/train> called 
the training negatives is collected from sequences embed- 
ding the binding sites. It is comprised of all the /-mers 
except for the TFBSs and two neighboring /-mers of each 
TFBS. 



The n+ TFBSs are first divided into v sets, each of which 
contains [^J or [^J + 1 TFBSs. At each iteration of v- 
fold CV, one of the v TFBS sets called the test TFBS set 
Ptest is left out. The rest of the TFBSs are therefore called 
the training TFBSs. A scoring function is obtained using 
the training TFBSs and non-TFBSs randomly sampled 
from the training negatives, where the ratio of numbers of 
non-TFBSs to TFBSs is set to 10. The test TFBSs in Ptest 
along with the non-TFBSs in A/test are then scored by the 
scoring function. To score a test sequence, both the for- 
ward and reverse strands are scored and, in case the test 
sequence is longer or shorter than /, the /-mer producing 
the highest score is used. For each test TFBS t g Ptest> 
we find its rank relative to all the non-TFBSs in A/test- For- 
mally, the rank of t equals 1 + |{5 e A/test|Score(5) > 
Score (Oil- 
After the v-fold CV, we end up with w+ ranks, each of 
which corresponds to a TFBS. To allow comparison of 
methods, we use the area under the ROC curve (AUC) to 
gauge the performance of a method on the TF. The ROC 
curve is a plot of true positive rate (TPR) against false 
positive rate (FPR), displaying the trade-off between TPR 
and FPR. We refer readers to [30] for an introduction to 
this metric. In this study, v = 10 for all the CV experi- 
ments. For the NPV and ODV methods, the best weight 
and subspace combination is obtained at each iteration of 
the v-fold CV Specifically, another (v — l)-fold CV is per- 
formed on the V — 1 sets of TFBSs to search for the best 
combination. 

Prolcaryotic transcription factor binding sites 

To understand the behavior of search methods on 
prokaryotic TF binding sites, we conducted 10-fold cross- 
validation experiments on the 26-TF RegulonDB data set. 
The proposed NPV and ODV methods were compared to 
the ULPB method [23]. The PSSM method, considered in 
[23], was also included for comparison since it served as a 
simple baseline method. 

Figure 4a shows the plot of area under the ROC curve 
(AUC) across the 26 TFs for each method. We can see 
that the ODV method has the best AUC on 12 out of 
26 TFs and the NPV method has the best AUC on 9 
out of 26 TFs whereas the ULPB and PSSM methods 
have the best AUC on 1 and 4 TFs, respectively. To 
gauge the relative performance between two methods, 
statistical tests [31] were performed on all the 6 pairs 
of methods. Figure 4b shows the /^-values of the pair- 
wise comparisons. We first notice that, consistent with 
the results in [23], ULPB outperformed PSSM with a 
slightly larger p-vdlue of 0.0679 than the usual 0.05 sig- 
nificance cut-off. As seen in Figure 4b, the NPV and 
ODV methods are significantly better than the PSSM and 
ULPB methods. We can see that the ODV method ben- 
efited from optimization albeit minimizing the objective 
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Figure 4 Comparison of the PSSM, ULPB, NPVand ODV methods on the RegulonDB data set. (a) Plot of AUC values across the 26 prokaryotic 
TPs for each method, (b) Matrix of p-values from pair-wise comparisons. A red solid circle in row / and column j indicates that method / 
outperformed method while a blue one in row / and column ; indicates that method / is inferior to method /The size and darkness of a circle imply 
the significance of the relationship between two methods. The larger and darker a circle, the more significant the relationship. White background 
indicates exceeding the usual 0.05 significance cut-off, while gray background indicates the opposite. 



function in (7) does not guarantee maximization of the 
AUC. 

Eukaryotic transcription factor binding sites 

Here we compare the proposed NPV and ODV methods 
to the ULPB and PSSM methods on eukaryotic TF bind- 
ing sites. As in the previous section, we conducted 10-fold 
cross-vaUdation experiments on the 28-TF JASPAR data 
set. Figure 5a shows the plot of AUC across the 28 TFs 
for each method. We can see that both the ODV and NPV 
methods have the best AUC on 13 out of 28 TFs while the 
ULPB and PSSM methods have the best AUC on 6 and 4 
TFs, respectively. All the methods have the best AUC of 



1 on MA0149.1 and MA0115, while the ODV, NPV and 
PSSM methods have the best AUC of 0.999 on MA0137.2. 

Similarly, statistical tests [31] were performed on all the 
6 pairs of methods. Figure 5b shows that the NPV and 
ODV methods are significantly better than the PSSM and 
ULPB methods. ULPB is significantly better than PSSM, 
which is again consistent with the results reported in 
[23]. Overall, performance of the four methods remain 
unchanged as we shift from prokaryotic transcription fac- 
tors to eukaryotic ones. This implies that a TFBS search 
method effective on prokaryotic transcription factors will 
perform equally well on eukaryotic transcription factors 
and vice versa. 
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Figure 5 Comparison of thie PSSM, ULPB, NPV and ODV methiods on tlie JASPAR data set. (a) Plot of AUC values across the 28 eukaryotic TPs 
for each method, (b) Matrix of p-values from pair-wise comparisons. A red solid circle in row / and column ; indicates that method / outperformed 
method j, while a blue one in row / and column ; indicates that method / is inferior to method /The size and darkness of a circle imply the 
significance of the relationship between two methods. The larger and darker a circle, the more significant the relationship. White background 
indicates exceeding the usual 0.05 significance cut-off, while gray background indicates the opposite. 



Motif subtype identification in vector spaces 

It has been shown that the binding sites of a TF can be 
better represented by 2 motif subtypes than by a sin- 
gle motif [32,33]. In search for new binding sites, two 
position-specific scoring matrices are used to score an /- 
mer and the higher score of the two is assigned to this 
/-men Searching with two PSSMs was shown to be supe- 
rior to searching with a single PSSM by cross-species 
conservation statistics in these studies. 

We demonstrate that motif subtypes can be readily 
identified once we embed /-mers in a vector space. The 
purpose here, however, is not to compare motif subtype 
identification algorithms. We adopted a slightly different 



approach to motif subtype identification from those in 
previous work [32,33], while the idea is similar. As usual, 
all the /-mers were first embedded in a vector space. 
The known binding sites of a TF were clustered into 
two subtypes by the /c-means algorithm [34] . Immediately, 
we have a variant of the NPV method called the /cNPV 
method, where k = 2 denotes the number of motif sub- 
types. The /cNPV method first computes the mean vectors 
of these two subtypes, /t^i and iJi_^2> scores an /-mer 
5 by 

Score(5) = max js-^ ~ t^-) ~ /^-) [ > 
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Axis 1 

Figure 6 Illustration of the /cNPV method. The solid arrows 
represent the negative-to-positive vectors fi^^ - ii_ and fi^j - A^-' 
pointing from fi_ to fi^^ and fi^j' respectively. The hallow triangles 
denote the known binding sites, whereas the circles represent the 
known non-binding sites. The centers of the binding site vectors are 
marked by the solid triangles, while the center of the non-binding site 
vectors is marked by the solid circle. 



where fi_ is the mean vector of the non-binding sites. 
Figure 6 illustrates the /cNPY method. 
Similarly, the /cODV method scores an /-mer s by 



Score(s) = m^x\s^P^J\\P^^\ls'p^2/\\P 



+21 



where is obtained using TFBSs in cluster /, / = 1, 2. 
Unlike the /cNPV method, the lengths of jS^/s may be very 
different and hence fi-^/s are scaled to unit vectors so as 



not to bias the scoring function. We note that the choice 
of /c = 2 came from previous studies [32,33]. Generally, k 
can be greater than 2 or even automatically selected [35]. 
This however is beyond the scope of this study and may 
be investigated in the future. 

We assessed the /cNPV and kODV methods by 10-fold 
cross-validation on both the RegulonDB and JASPAR 
data sets. Figure 7 shows the results in terms of AUG. 
We observe in Figure 7a that overall introducing motif 
subtypes into the NPV and ODV methods improves 
the search performance (^-values: 6.41 x 10~^ and 
8.31 X 10~^, respectively). Results in Figure 7b also 
support this observation (/^-values: 1.61 x 10~^ and 
3.04 X 10-^ respectively). The /cNPV and kODV are 
comparable on both the RegulonDB and JASPAR data 
sets {p'Vdlues: 0.197 and 0.47, respectively). These results 
are consistent with those reported in [32,33]. 

Independent validation on ChlP-seq data 

To evaluate the proposed NPV and ODV methods on the 
whole genome scale, we built TF models using TFBSs in 
the JASPAR database to scan all the human (build hgl9) 
1000-base promoter sequences obtained from the UGSG 
Genome Browser database [36]. GhlP-seq peaks from the 
ENGODE project were also retrieved [37]. Specifically, 
the wgEncodeRegTfbsGlusteredV2 table of build hgl9 was 
obtained. We checked TFs in Table 2 against the annota- 
tions and found 14 JASPAR TFs, recognized by 17 anti- 
bodies present in the ENGODE annotations. The mapping 
is listed in the first 3 columns of Table 3. 

For the NPV and ODV methods, the best weight 
and subspace combination was found by 5-fold cross- 
validation on the JASPAR TFBSs, while flanking genomic 




Lee and Huang BMC Bioinformatics 201 2, 13:21 5 
http://www.biomedcentral.eom/1 471 -2 1 05/1 3/2 1 5 



Page 10 of 12 



Table 3 Results of Independent validation on ChlP-seq data 
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Subspaces (S) R^', r(16/-i2) ]^(36/-48) denoted by 1, 2 and 3, respectively. 



sequences of the TFBSs were the sources of negative 
binding sites. To assess the 4 compared methods, we con- 
sidered the part of a ROC curve where FPR is at most 0.01 
and calculated the AUG scaled to between 0 and 1. This is 
nearly equivalent to allowing at most 10 false positive hits 
per promoter on average. As a peak spans about 200 bases, 
it is considered recalled when it fully contains a predicted 
binding site. Similarly, a predicted binding site must be 
fully covered by a peak to be a true positive hit. 

In Table 3, we observe that ODV, NPV, ULPB and PSSM 
produced the best AUG on 13, 1, 1 and 3 out of 18 
tests, respectively. Statistical tests showed that ODV sig- 
nificantly outperformed the other 3 methods (p-values < 
0.0028), NPV significantly outperformed ULPB and PSSM 
(p-values < 0.0449), and ULPB and PSSM are compara- 
ble (/^-valuer 0.433). We notice that both NPV and ODV 
performed worse than the other two methods on MEF2A. 
As NPV and ODV both sample negative examples from 
flanking sequences of TFBSs, we suspect that this is one 
example where the flanking sequences do not represent 
well the entire promoters. ODV performed consistently 
across tests corresponding to the same JASPAR ID such as 
the three for GTGF. Examining the best weight and sub- 
space, we can see that the subspace agrees on 11 out of 
14 TF models, while the weight agrees on only 7 of them. 
The latter may be because ODV optimizes the ^ vector 
and hence is less sensitive to the weight used to embed an 
/-mer. 



Conclusions 

In this work, we proposed to search for transcrip- 
tion factor binding sites in vector spaces. The novel 
NPV and ODV methods were introduced to construct 
a query vector to search for binding sites of a TF. We 
compared our methods to a state-of-the-art method, 
the ULPB method, and the widely-used PSSM method. 
Gross-validation experiments revealed that the NPV and 
ODV methods significantly outperformed the ULPB and 
PSSM methods on prokaryotic as well as eukaryotic 
TF binding sties. Independent validation on human 
GhlP-seq data further verified that the NPV and ODV 
methods are significantly better than the other compared 
methods. 

One of the advantages of our framework is that it allows 
one to easily search for binding sites in various sub- 
spaces. Hence, one can search in the best subspace for 
each individual TF since one can hardly find an opti- 
mal subspace for all the TFs. Another advantage is that 
under the proposed framework one can readily identify 
motif subtypes for a TF. Hence, to exploit this advantage, 
we introduced the /:NPV and /rODV methods, immediate 
variants of the NPV and ODV methods. We demonstrated 
that, consistent with results in previous studies, /:NPV 
(/rODV) significantly improved NPV (ODV) on the two 
data sets. 

Our future work aims for extending our proposed 
methods to handling known binding sites of variable 
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lengths. We will seek to approach this problem without 
resorting to multiple sequence alignment, which is notori- 
ously time-consuming. In the meantime, we will also seek 
to identify additional promising subspaces to search for 
TF binding sites in. 
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